iWarp: An Integrated Solution to High-Speed Parallel Computing 



Shckhar Borkar, Robot Cohn, George Cox, Sha Gleason, Thomas Gross, 
H. T. Kung, Monica Urn, Brim Moore, Craig Peterson, John Pieper. 
Linda Rankin, P, S. Tseng, Jim Sutton, John Urbanslri, and Jon Webb 

Department of Computer Science Intel Corporation, JF 1 -60 

Carnegie Mellon University 5200 N.E. Elam Young Pkwy 

Pittsburgh, Pennsylvania 15213 Hillsboro, Oregon 97124 



Abstract 

i Warp is a system architecture for high speed signal, image 
and scientific computing. The heart of an iWarp system is (he 
iWarp component: a single chip processor lhal requires only 
the addition of memory chips to form a complete syitem 
building block, called the iWarp ceil. Each iWarp component 
contains both a powerful computation engine (20 M FLOPS) 
and a high throughput (320 MBytes/see), low latency 
(100-150 ns) communication engine for interfacing with other 
iWarp cells. Because of its strong computation and com- 
munication capabilities, the iWarp component is a versatile 
building block for various high performance parallel systems. 
These systems range from special purpose systolic arrays to 
general purpose distributed 1 memory computers. They are 
able to support both fine-grain parallel and coarse-grain dis- 
tributed computation models simultaneously in (he same sys- 
tem. An iWarp system can include a large number of cells; 
the initial iWarp demonstration syitem consists of an 8x8 
torus of iWarp cells, delivering more than 1.2 G FLO PS. It 
can be expanded to include up to 1,024 cells. This paper 
describes the iWarp architecture and how it supports various 
communication models and system configurations. 

1, Introduction 

i Warp is a product of a joint effort between Carnegie Mel- 
lon University and Intel Corporation. The goal of the effort is 
to develop a powerful building block for various distributed 
memory parallel computing systems and to demonstrate its 
effectiveness by building actual systems. The building block 
is a custom VLSI single chip processor, called iWarp, which 
consists of approximately 600,000 transistors. 

The iWarp component contains both a powerful computa- 
tion processor (20 MFLOPS) and a high throughput (320 
MBytes/sec), low latency (100-150 ns) communication en- 
gine. Using nonpipelmed floating-point units, the compula- 
tion processor will sustain high computation speed for vec- 
torizable as well as non-vectoruable codes. 
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An iWarp component connected to a local memory forms 
on iWarp cell: up to 64 MBytes of memory are directly 
addressable. A large array of iWarp cells will deliver an 
enormous computing bandwidth never before realized in dis- 
tributed memory parallel systems. Because of the strong 
computation and communication capabilities and because of 
its commercial availability, iWarp is expected to be an impor- 
tant building block for a diverse set of high performance 
parallel systems. 

The iWarp architecture evolved from the Warp 
machine (1], a programmable systolic array developed at Car- 
negie Mellon and produced by General Electric. All applica- 
tions of Warp, including low-level vision, signal processing, 
and neural network simulation [2, 18], can run efficiently on 
iWarp. But systems made of the iWarp building block can 
achieve at least one order of magnitude improvement over 
Warp in cost, reliability, power consumption, and physical 
size. Much larger arrays can be easily built. The clock speed 
of iWarp is twice as high as Warp; the increase in computa- 
tion throughput is matched by a similar increase in I/O 
bandwidth. Therefore we expect iWarp to achieve the same 
high efficiency as Warp. For example, the NETtalk neural 
network benchmark [20] runs at 16 J million connections per 
second and 70 MFLOPS on a 10 cell Warp array; the same 
benchmark runs at 36 million connections per second and 153 
MFLOPS on an i Warp array of the same number of cells. 

Although the design of the iWarp architecture profited 
greatly from programming and applications experiences 
gained from many Warp machines in the field, iWarp is not 
just a straightforward VLSI Implementation of Warp. iWarp 
is intended to have a much mora expanded domain of applica- 
tions than Warp. The following summarizes the goals of 
i Warp as a system building block: 

• iWarp is useful for the implementation of both special 
purpose arrays, which require high computation and I/O 
bandwidth, and general purpose arrays where program- 
mabUity and programming support are essential. 

• iWarp is useful for both high performance processors 
attached to general purpose hosts and autonomous 
processor arrays capable of performing all the computa- 
tion and VO by themselves. That is, iWarp can be used 
far both "host centric" and "array centric" processing. 

• iWarp supports both tightly and loosely coupled paral- 
lel processing, and both systolic [12] and message pass- 
ing models of communication. 
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• iWarp can implement a variety of processor intercon- 
nection topologies including 1 -dimensional (ID) arrays, 
rings. 2-dimensional (2D) arrays, and tori, 

• iWarp is intended for systems of various sixes ranging 
from several processors to thousands of processors. 

This paper will explain how the iWarp architecture addresses 
these objectives. 

Besides conventional high level languages such as C and 
FORTRAN, the programming of iWarp arrays will be sup- 
ported by programm ing tools such as parallel program 
generators. Previous experience and current research on 
Warp indicates that parallel program generators are one of the 
most promising approaches to programming distributed 
memory parallel computers. In (his approach, a specialized, 
machine independent language is cr e a te d, which embodies a 
particular parallel computation model (for example, input par- 
titioning, domain partitioning, or task-queuing (13]). The 
compiler for that language then maps the program onto a 
target parallel architecture. This approach can allow efficient 
parallel programs to be generated automatically for large 
processor arrays. 

Automatic parallel program generators have been 
developed for iWarp in two applications areas: scientific com- 
puting and image processing. The scientific computing lan- 
guage, called AL (Array Language) [21], incorporates the 
domain parti boning model and allows programmers to trans- 
fer data between a common space and a partitioned space, 
perform computation in parallel in the partitioned space, and 
then transfer data back. For scientific routines such as those 
found in UNPACK (5], the AL compiler generates efficient 
code for fWarp and Warp, as well as far uniprocessors. The 
image processing language, called Apply [9], incorporates the 
input partitioning model: the input images are partitioned 
among the processors, each of which generates part of the 
corresponding output image. Apply compilers exist for 
iWarp, Warp, uniprocessors, and the Meiko Computing Sur- 
face, as well as several other computer architectures. 
Benchmark comparisons on Apply programs have validated 
the above claims [22]. 

As of July 1988, the architecture and logic designs for 
i Warp have been completed In the software area, an optimiz- 
ing compiler developed for Warp [8, 16) has been retargeted 
to generate code for iWarp. Using this compiler, the iWarp 
performance on real programs, including those generated by 
the parallel program generators mentioned above, have been 
evaluated on an iWarp architecture simulator. A prototype 
iWarp system is expected to be operational by the end of 
1989. Three demonstration systems, each consisting of an 
8x8 torus of iWarp cells, are scheduled to be operational in 
the middle of 1990. 

The organization of this paper is as follows. In the next 
section, we give an overview of iWarp and how iWarp sys- 
tems can be constructed from them. Some sample iWarp 
usages and system configurations are described in Section 3. 
Sections 4 and 5 deal with one of the most innovative features 
of ihe iWarp architecture — the iWarp iniercell communica- 
tion models and mechanism. The computation part of the 
iWarp component is discussed in Section 6, which also in- 
cludes some preliminary iWarp performance figures on the 
Ltvermore Loops benchmark. Finally, a summary of the 
paper and some concluding remarks arc given. 



2. IWarp overview 

An iWarp system is composed of a collection of iWarp 
cells, each of which consists of an iWarp cornpanent end its 
local memory. This section first gives an overview of the 
iWarp component and then summarizes how an iWarp cell 
physically interfaces with the external world so that various 
iWarp systems can be constructed. 

2.1. IWarp component 

The iWarp component has a communication agent and a 
computation agent, as depicted in Figure 1. The computation 
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Figure 1. iWarp component overview 

agent can carry out computations independently from the 
operations being performed at the communication agent. 
Tnerefore a cell may perform its computation while com* 
muni cation through the cell from and to other cells is taking 
place, and the cell program does not need to be involved with 
the communication. While separating the control of the two 
agents makes programming easy, having the two agents on 
the same chip allows them to cooperate in a tightly coupled 
manner. The tight coupling allows several communication 
models to be implemented efficiently, as to be discussed in 
Section 4. The major blocks in iWarp are shown in Figure 2. 
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Figure 2. Major functional units in iWarp 

In the following we summarize the major features in the 
two agents and their interface. The performance numbers are 
based on the expected clock speed of 20 MHz, i.e.. a clock is 
30 ns. Further discussions on iWarp communication and 
computation features are provided in Sections 5 and 6. respec- 
tively. 
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Communication agent 

Four input and four output ports 

• 40 MBytes /sec data bandwidth per port 

« Word by word, hardware flow control at each part 

• An output port can bo connected to an input port of 
another iWarp cell via e point-to-point physical bus. 

Multiple logical busses multiplexed on each physical bus. 
m Maintaining up to 20 incoming pathwtyi simultaneously 
in an iWarp component 

• Idle logical busses do not consume any bandwidth of the 
physical bus. 

Pathway unit 

• Routing for ID and 2D configurations 

• Capable of implementing wormhole and streetsign routing 
schemes 

Both message passing and systolic communication are sup- 
ported for coarse-grain and fine-grain parallel computation. 

Computation agent 

Computational units 

• Floating-point adder 

• 10 and 5 MFLOPS for 32- and 64-bit additions (IEEE 
754 standard), respectively 

• Nonpipe lined 

• Floating-point multiplier 

• 10 and 5 MFLOPS for 32- and 64 -bit multiplications 
(IEEE 754 standard), respectively 

• Nonpipelincd 

• Full divide, remainder, and square root support 
■ Integer/logical unit 

• 20 MIPS peak performance on 8/1 6/32-bil 
integer/ordinal data 

• Arithmetic, logical, and bit operations 

All the above three units may be scheduled to operate in 
parallel in one instruction, generating a peak computing rate 
of 20 MFLOPS plus 20 MIPS. 

Internal data storage and interconnect 

• A shared, multiported, 128 word register file 

• Special register file locations for local memory and com- 
munication agent access 

Memory units 

• Off -chip local memory for data and instructions 

• Separate address and data busses (24-bit word address 
bus, 64-bit data bus) 

• 20 million memory accesses/sec peak performance 

• 160 MBytes/sec peak memory bandwidth 

" Read, write, and read/rnodify/wriie support 

• On-chip program store 

• 256 word cache RAM 

• 2K word ROM (built-in functions) 

• 32- and 96-bit instructions 

Communication and computation Interface 

Communication agent notifies computation agent on message 
arrival. 

Dynamic flow control: Computation agent spins when read- 
ing from an empty queue or writing to a full queue in com- 
munication agent. 

Hardware spools data between queues and local memory. 



22, Forming IWarp systems 

Various iWarp systems can be constructed with the iWarp 
cell. We describe how copies of the iWarp cell can be 
connected together, and how an iWarp cell can connect to 
peripherals to form these systems. 

There are two ways that an iWarp cell, consisting of an 
iWarp component and its local memory, physically interfaces 
with the external world. Recall that the iWarp component has 
four input ports and four output ports. The fin t interface 
method is to use a physical bus to connect an output port of an 
i Warp to an input port of another. The former and latter port 
can write to and read from the bus, respectively. Thus this is 
a unidirectiona] bus between the two components, as 
represented by the arrowed edge in Figure 3 (a). Usually 
another unidirectional bus in opposite direction is also 
provided, so that bidirectional data communication between 
the two component is possible. This is illustrated in Figure 3 
(b). 
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Figure 3 . InterccU connection via ports 
of i Warp components: (a) unidirectiona] bus and 
(b) two unidirectional busses in opposite directions 

The second interface method is via the local memory of the 
iWarp cell, as depicted by Pigure 4. Using this interface the 




Figure 4. Connection with peripherals via local memory 
of an i Warp cell 

iWarp cell can reach peripherals such as standard busses, 
disks, graphics devices and sensors. Therefore the iWarp 
cell's connection with peripherals uses the local memory, 
while its interccU connection uses ports of iWarp. Since these 
two functions use different physical resources of the iWarp 
cell, they can be implemented independently from each other. 
This implies, for example, that peripherals can be attached to 
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any set of iWarp cells in an amy of iWarp cells, indepen- 
dently from the array interconnection topology. Wilh these 
two interface methods many system configurations can be 
implemented as will be shown in Section 3. 

3. iWarp usages and system configurations 

The iWarp cell, consisting of the iWarp component and 
local memory, is a building block for a variety of system 
configurations. These systems can be used as general and 
special -purpose computing engines. This section describes 
some of these usages and system configurations. 

3.1. General purpose arrays 

With its four pairs of input and output ports, the iWarp cell 
is a convenient building block for a 2D array or torus. Figure 
5 depicts a 3)0 torus. Peripherals can be attached to any of 
the iWarp cells via its local memory. The initial demonstra- 
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Figure 5. 3*3 torus 

tion iWarp system in 1990 is an 8x8 torus, with a total of 32 
MBytes SRAM. It has a peak performance of 1,280 
MRjOPS. The memory of each cell can be expanded up to 
)5 MBytes, and with different memory components, a 
memory space of up to 64 MBytes per cell is possible. The 
same system design can be extended to a 32x32 torus, giving 
an aggregate peak performance of 20,480 M FLOPS. 

' The iWarp cell is a building block for ID arrays or rings as 
well. Figure 6 depicts a 6-ccll ring. A ID array or ring of a 
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Figure 6. 6-cell ring 

moderate number of iWarp cells, delivering on the order of 
hundreds of MFLOPS, can be an effective auached processor 
to a workstation. This has been demonstrated by the 10 -cell 
Warp array. Using the same approach with iWarp, we will 
achieve an order of magnitude improvement in cost- 
performance over Warp. To meet the requirement of lower 
cost (and lower performance) applications, one or a few 
iWarp cells can also form a single-board accelerator for low. 
end workstations or PCs. 



3.2. Special-purpose arrays 

Many systolic algorithms can make effective use of large 
processors arrays for applications such as signal processing 
and graphics [10]. With the iWarp cell various special- 
purpose arrays that execute only a predetermined set of these 
algorithms can easily be built. For example, a hexagonal 
array (as depicted in Figure 7) with unidirectional physical 
busses between cells can be built to execute some classical 
systolic algorithms for matrix operations [IS]. For such an 
array, sensors and array output ports may be connected to the 
local memories of a number of cells, so that I/O can be carried 
out in parallel. In areas such as high-speed signal processing, 
special-purpose arrays can effectively use hundreds or even 
thousands of iWarp cells. In some systolic algorithms, celts 
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Figure 7. Hexagonal array with urudirectional 
physical busses between cells 

on the array boundary may execute a different function from 
cells inside the array [7]. In this case, individual iWarp cells 
can be programmed to perform different functions according 
to their locations in the array. 

In general, the performance of special-purpose arrays made 
of iWarp cells will be comparable to that of those arrays made 
of custom hardware using similar VLSI technology. Al- 
though the iWarp array will probably have a larger physical 
size, it can be readily programmed to implement the target 
algorithms and will incur a much shorter development time. 

4. Communication models 

Interprocessor communication is an integral part of parallel 
computing on a distributed memory processor array. To 
balance the high numerical processing capability on the 
processor, iWarp must be equally efficient in communication. 
The development of efficient parallel software is simplified if 
the communication cost is low and can be estimated reliably. 

To motivate the communication agent design on iWarp, in 
this section we first describe two important communication 
models commonly used on distributed memory processor ar- 
rays: message passing and systolic communication. We will 
study the requirements lo implement these models efficiently, 
from the data transport love) all the way to the integration of 
communication wilh the data processing. We then describe a 
set of unifying programming abstractions to support these 
models. The next section shows how they are supported by 
the iWarp communication facility, while meeting the perfor- 
mance requirements of their usages. 
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4.1. Communication models 

We have identified two communication models used for 
distributed memory parallel systems: message passing and 
systolic, They differ primarily tn the granularity of com- 
munication and computation. In message passing mode, as in 
computer networks, the unit of processing is s complete mcs- 
tage. That is, a message is accumulated in the source cell 
memory, transmitted (as a unit) to die destination cell, and 
only when the full message is available in the local memory 
of the destination cell is it ready to be operated upon. Con- 
versely, in systolic mode [12], the unit of communication and 
processing can be as fine grained as a single word in a 
message. 

4.1.1, Message passing 

Message passing is a commonly used mode] for coarse- 
grain parallel computation. Processes at each cell operate 
independently on the cell's local data and only occasionally 
communicate with other cells. The timing, order, and even 
the communication partner are often determined at run time. 
The dynamic nature makes certain communication overheads 
unavoidable, such as routing the message across the array and 
asynchronously invoking the answering party. To efficiently 
support message passing, we need the capabilities described 
below. 

Hardware support for ID and 2D configurations. ID 
arrays and rings are easier to build than 20 arrays and tori. 
However, compulations on a 2D configuration can be more 
efficient than those on a ID configuration for large systems 
with many cells. Suppose that there ore n cells in the system. 
Using a 2D configuration, not only is the distance between 
cells reduced from O(n) to O(Vn) hops, but also is the effec- 
tive bandwidth of the communication network increased since 
a transfer takes up fewer hops. 

Spooling. Suppose a process wants to send a message to a 
destination cell. Since the communication network is shared, 
a process does not hove guaranteed instantaneous access. 
Ideally, the sender process can simply specify the destination 
and message location, and continue with its processing 
regardless of the availability of the network, and a separate 
thread of control spools the data out of the memory. 
Similarly, another spooling process can take the data from the 
network arid store it into the receiver's memory, with minimal 
interference with the computational process in progress. 

Separate communication support hardware. In an fWarp 
array, a message may be routed through intermediate cells 
before reaching the final destination. The routing of data 
through a cell is logically unrelated to the process local to the 
cell. It can be supported in dedicated hardware to handle the 
high bandwidth of the array. 

Word-level synchronization. Although the granularity of 
communication is a message, this docs not mean that the 
intermediate hops should forward the data at the same grain 
size. In wormhole routing [3], the routing information in the 
header of the message can be used to set up the next leg of the 
communication path even before the rest of the data arrives. 
The contents of the message can be forwarded word by word, 
without having to be buffered in entirety on intermediate 
cells. Wormhole routing reduces the latency of communica- 
tion and does not take up any of the memory bandwidth of the 
intermediate cells. Since the communication path is built link 
by link, a word-by-word handshake is necessary to throttle the 
data flow in case the next link is temporarily unavailable. 



Multiplexing messages on a physical bus. Since wormhole 
routing may use up multiple links at the same time, preven- 
tion of deadlock is necessary in ring or torus architectures. 
One scheme to prevent deadlock is the virtual channel 
method, which uses another set of links when routing beyond 
a certain cell [4, 19]. If in each direction only one physical 
bus is available between two connecting cells, this deadlock, 
prevention scheme requires that multiple communication 
paths be multiplexed on a physical bus. Multiplexing can also 
be used to keep a long message from monopolizing the physi- 
cal bandwidth for an indefinitely long time. 

Door-to-door message passing. When a message arrives at 
the destination cell it is generally first buffered in a system 
memory space and then copied into the user's memory space. 
This extra copy can be eliminated if the data is stored directly 
into the desired memory location. Using the data throttling 
mechanism above, the receiving process can first examine the 
message header to determine the memory address for the 
message. We call this scheme of shipping data directly from 
a sender's data structures to a receiver's data structures, with- 
out any system memory buffering, door-to-door message 
passing. 

4.1.2. Systolic communication 

Systolic communication supports efficient, fine- grain paral- 
lelism. In this model, the source cell program sends data 
items to the destination cell as it generates them, and the 
destination cell program can start processing the data as soon 
as the first ward of input has arrived. For example, the 
output! of an adder in one cell can be used as operands to the 
adder of another, without going through the memories of 
either cells, in a matter of several clocks. This mode provides 
tight coupling and synchronization between cooperating 
processes. 

Systolic algorithms rely on the ability to transfer long 
streams of intermediate data between processes at high 
throughput and with low latency. More importantly, (he com- 
munication cost must be consistently small, because cost 
variations can greatly Increase delays in the overall computa- 
tion. This implies that dedicated communication paths are 
desirable, which may be neighboring or non-neighboring 
paths depending on communication topologies of the algo- 
rithm. 

Raw data words are sent along a communication path, iden- 
tified only by their ordering in the data stream. The sender 
appends data to the end of the data stream and the receiver 
must access the words in the order they arrive. Our ex- 
perience with the Warp systolic array [1] shows that FIFO 
queuing along a communication path is useful in relaxing the 
coupling between the sender and the receiver. A sender does 
not need to wait for the receiver unless (he queue is full; 
similarly, the receiver con process the queued data until they 
run out. Word-level synchronization is provided by stalling a 
process mat cries to read from an empty queue or write to a 
full queue. 

A special-purpose systolic array can be tailored to a specific 
algorithm by implementing the dedicated communication 
paths directly in hardware and providing long enough queues 
to ensure a steady flow of data. As a programmable array, 
iWarp processors should implement the common systolic al- 
gorithms well, but can also degrade gracefully to cover other 
algorithms. We have identified (he following requirements 
for efficient support of systolic communication. 
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Hardware support for ID and 2D configurations. Many 
systolic algorithms in signal and image processing and in 
scientific computing use ID and 2D processor arrays (10]. 
iWarp can directly support such configurations in hardware. 

MuUtpUxtng communication paths on a physical bus. 
iWarp can also support other configurations, with degraded 
performance, if necessary. A systolic algorithm may call for 
more communication paths between a pair of connecting ceDs 
than those provided directly by hardware. It may require 
extra communication paths for configurations such as a 
hex agonal array, or to implement deadlock avoidance 
schemes [14]. Drv ide-and-conquer algorithms may require 
communication between powers of two distances away at 
different times of the algorithm. All these considerations 
motivate the need to multiplex multiple communication paths 
oh a physical bus. 

Coupling of computation and communication. The com- 
putation part of a cell needs to access the communication part 
directly without going through the cell's local memory. This 
extra source of data is a key to systolic algorithm's efficiency. 
For example, fine-grained systolic algorithms for important 
matrix operations can consume and produce up to four data 
words per clock. Memory bandwidth cannot match this high 
communication bandwidth. 

Spooling. Regardless of the size of the hardware queue 
available on each cell, there is always some systolic algorithm 
that requires deeper queues. For example, a systolic algo- 
rithm for convolving a kernel with a 2D image requires some 
cells to store the entire row of the image [1 1]. Therefore, it is 
desirable to provide an automatic facility to overflow the data 
to the cell's local memory if necessary. 

4.1 J. Reserving a communication subnetwork 

In both message passing and systolic communication 
models, there is a need to multiplex multiple communication 
paths onto a physical bus. Efficient support of communica- 
tion paths requires dedicated hardware resources, thus only a 
small number of paths can be provided. This resource limita- 
tion raises the issue of resource management. 

The need for managing the communication resource is more 
pronounced in the systolic communication model. This is 
because the production and consumption rales of a data 
stream arc tied directly to the computation rates of the cells. 
As the computation on a cell can stall and even deadlock 
while waiting for data, the lifetime of a communication path 
can be arbitrarily long. Although an idle communication path 
does not consume any communication bandwidth, if all the 
multiplexed paths on a physical bus are occupied, no other 
traffic can get through. 

In message passing, an entire message is fust prepared and 
buffered in the sender cell's local memory, and (he message is 
stored into the destination cell's local memory directly. Once 
a communication path becomes available, the data can be 
spooled in and out of the memories. With a proper routing 
scheme to avoid deadlocks a message can always get through, 
although it may have to wait for a while if the network is 
backed up with long messages. 

Both models can benefit from a mechanism to reserve a set 
of communication paths for a class of messages. For instance, 
we can reserve a set of communication paths for system 
messages for purposes such as synchroniiation, program 
debugging and code downloading. First of all, this guarantees 



that the system can reach all the cells even if the user uses up 
all other paths. Moreover, since the system has full control 
over all messages on the reserved network, the behavior of (he 
network is more predictable, and attributes such as a 
guaranteed response time are possible. 

4.2, Messages and pathways 

Message passing and systolic communication are two very 
kinds of communication. The former supports coarse-grain 
parallelism where processes at different cells behave indepen- 
dently, and the latter supports fine-grain parallelism where 
processes at different cells cooperate synchronously. 
However, on examining the requirements to make message 
passing efficient and systolic communication general, they are 
not thai dissimilar. For example, wormhole routing uses up 
multiple hardware links simultaneously, much like a com- 
munication path in systolic communication thai connects two 
non-neighboring processes. On iWarp, both communication 
models can be unified and supported efficiently by (he same 
programming abstractions of a pathway and a message, 
defined below. 

A pathway is a direct connection from a cell (called the 
source cell) to another cell (called (he destination cell). Each 
segment of the pathway that connects the communication 
agent of a cell to the computation agent of the same cell or to 
the communication agent of another cell is called a pathway 
segment. (See Figure 9 for examples of pathways.) 

A message consists of a header, a sequence of data words, 
and a marker denoting the end of (he message. Messages can 
be sent from the source cell to the destination cell over a 
pathway. The pathway is initiated and terminated by the 
source cell. It assembles a header containing a destination 
address and additional routing information and hands it to the 
corrunuTucation agent The source cell closes the pathway by 
sending a special marker to signal the end. 

Normally, one pathway is set up for each individual mes- 
sage. The source cell opens a pathway to the destination cell, 
sends its message, and then closes the pathway. In (he mes- 
sage passing model, the sending process dynamically creates 
a new pathway and message for each data transfer. Not all 
intermediate links of a pathway need to exist at the same time. 
In wormhole routing, the marker denoting the end of (he 
pathway may have reached an intermediate cell even before 
the header and data reach the destination. In the systolic 
communication model, the cells typically set up required 
pathways for a longer duration. The sending program trans- 
mits individual data words along a pathway as they are 
generated without sending any additional headers or markers. 
On termination, the cell programs close the message and (he 
pathway. 

However, it is possible that a pathway is set up for multiple 
messages. That is, the sender cell does not take down the 
pathway immediately after the first message has passed 
through, so the sender can send further messages over (he 
same pathway. The sender cell has reserved the pathway for 
its future use. 

Reservation of multiple pathways is also possible on iWarp. 
Two pathways are said to be connected if the destination cell 
of one is the source of the other. The sender of the first 
pathway can send messages to (he destination of the second 
using both pathways. In this way, a cell can send messages to 
multiple destinations through a set of reserved pathways. 
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5. iWarp communication 

This section describes how the communication agent on 
iW flip implements the above programming abstractions and 
satisfies the performance requirement! of both ihe message 
passing and systolic communication models. We break down 
the functionality of the communication agent into four 
categories. The categories and the requirements they fulfill 
are summarized as follows: 

1. Physical communication network: Hardware 
support for ID and 20 configurations. 

2. Logical communication network'. A 
mechanism to multiplex multiple pathway seg- 
ments on a physical bus on a word level basis. 

3. Pathway unit. A mechanism to establish path- 
ways. 

4. Streaming and Spooling unit. Direct access of 
communicarion agent from the computation 
agent, and a spooling mechanism to transfer 
data from and to local memory. 

5.1. Physical communication network 

The communication network of iWarp is based on a set of 
high bandwidth point-to-point physical busses, linking the 
input and output ports of a pair of cells. Each cell has four 
input and four output ports, allowing cells to be connected in 
various topologies. Figure 8(a) iUuso-ates the 2D array con- 
figuration where each cell is connected bidbec density to four 
cells. 

Each physical bus can transmit one 32-bit word data every 
100 ns. The VLSI custom chip implementation made it 
possible to have fine-grain, word level handshaking without 
any synchronization delay. Thus, each physical bus has a data 
bandwidth of 40 MBytes/sec, giving an aggregate data trans- 
fer rate of 320 MBytes/sec. 

5.2. Logical communication network 

Besides the four input and four output external busses 
described above for connecting to other cells, a communica- 
tion agent is also connected to the computation agent in its 
cell through two input and two output internal busses. Each 
of these busses can be multiplexed on a word level basis to 
support a number of logical busses in the same direction, 
whereas each logical bus can implement arte pathway seg- 
ment of a pathway at a time. 

To the communication agent on a cell, (here are four kinds 
of logical busses: 

1 . incoming busses from communication agent of a 
neighboring cell, 

2. incoming busses from computation agent of the 
svnz cell, 

3. outgoing busses to communication agent of a 
neighboring cell and 

4. outgoing busses to computation agent of the 
same cell 

The mapping of the logical busses to physical busses is 
performed statically, under software control. The hardware 
allows the total number of incoming logical busses in the 
communication agent of each cell to be as large as 20. For 
example, in a 2D array, the logical busses can be evenly 



distributed among the four neighbors and the computation 
agent, as shown In Figure 8(b). In this case, the heart of the 
communication agent is a 20x20 crossbar that links incoming 
logical busses to outgoing logical busses. Logical busses are 
managed by the source, Le., the sending cell. The sender can 
initiate communication using any of its pre- allocated free 
logical busses without consulting the receiver. This design 
minimizes the time needed to setup a pathway between cells. 




Figure 8. (a) Physical communication network 
(b) logical busses of a cell 

5 J. The pathway unit 

A pathway is formed by connecting a sequence of pathway 
segments together. Figure 9 contains an example of three 
pathways through some cells in a 2D array. Pathway 1 
connects the computation agent of B to that of A through a 
pathway segment between the communication agents of the 
two cells. Pathway 2 passes through cell A, turns a comer at 
cell C and finally reaches the destmation D. Lastly, path- 
way 3 passes through both cells C and D. Two pathway 
segments are multiplexed on the physical bus from cell C to 
ceD D. 

A new pathway U established by the use of a special open 
pathway marker. As the communication agents pass the open 
pathway marker along from cell to cell, they allocate 
resources to form the pathway. 

Open pathway markers carry addresses to tell how to route 
them from their source to their destination, using streetsign 
routing (eg., "go to Jones and stop*'). There are two com- 
ponents of such a streetsign: the streetname (e.g., Jones) and 
the associated action (e.g., stop). The controller provides 
special hardware support for address recognition with a mul- 
tiple entry Address Match CAM. For example, a pathway 
route might consist of "go to Jones, turn right, go to Smith, 
turn left, go to Johnson, and stop'*. Each pathway unit is 
responsible for recognizing addresses of open pathway 
markers requiring service or attention at that cell. Given the 
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sequential nature of itrcctsign routing interpretation, a given 
communication agent needs to deal with only the "next** 
s tree mam e on passing open pathway markers. 

Upon the arrival of an open pathway marker, the pathway 
unit interprets the address to see if it is addressed to this cell, 
and, if so, posts an event to the computation agent to invoke 
the appropriate routine. Otherwise, it finds a free outgoing 
logical bus along the route given by the header and connects it 
to the incoming logical bus. The pathway is dismantled, one 
link at a time, by the flow of a dose pathway marker along 
the pathway, cell to cell from the source to the destination, 

The latency of communication through a cell is 100 ns 
normally, and 150 ns in the case of comer turning. The 
interpretation of addresses and the establishment of pathways 
are completely performed by hardware. Creating a new path- 
way segment does not incur any additional time delay. 

A ' B 




C D 



Figure 9. Pathways in a 2D array 

5.4. The streaming and spooling unit 

The computation agent can get access to the communication 
data by (1) directly accessing the communication agent a 
word at a time, or (2) spooling the data in and out of local 
memory using special hardware support. 

Programs can read data from a message or write data to a 
message via the side effects of special register references. 
These special registers are called streaming gaits, because 
they provide a "gating" or "windowing" function allowing a 
stream of data to pass, word by word, between the com- 
munication agent and the compulation agent. There arc two 
input gates and two output gates. These gates can be bound to 
different logical busses dynamically. A read from the gate 
will consume the next word of the associated input message; 
correspondingly, a write to an output gate will generate the 
next word of the associated output message. Data word-at-a- 
lime synchronization is expressed in algorithms by the side 
effects of gate register references (e.g., a read of an input gate 
at which no data is available causes the instruction to spin 
until the data arrives). 



iWarp also provides a transparent, low overhead 
mechanism for transferring data between the pathway unit and 
the local memory via spooling gates. Spooling has low over- 
head to avoid significant reduction of the efficiency of any 
ongoing or parallel computation. Spooling is transparent ex- 
cept for delays incurred due to cither cycle stealing (i.e., for 
address compulation) or local memory access interference 
from other memory references (i.e., due to concurrent cache 
or instruction activities). 

6. IWarp computation 

The iWarp processor is designed to execute numerical com- 
putations with a high sustained floating-point arithmetic rote. 
The iWarp cell has a high peak computation rate of 20 
MFLOPS for single precision and 10 MFLOPS for double 
precision. More importantly, i Warp can attain a high com- 
putation rate consistently. This is because the multiple func- 
tional units in the computation agent are directly accessible 
through a long instruction word (UW) instruction. By trans- 
lating user's code directly into these long instructions using 
an optimizing compiler [16], a high computation rate can be 
achieved for all programs, vectorizable or not 

6.1. The computation agent 

The compulation agent has been optimized for UW con- 
trolled, parallel operation of multiple functional units. Chief 
among these optimizations are: 

• nonpipc lined floating-point arithmetic units, 

• inter-unit and intra -unit, output to input, operand 
bypassing, 

• parallel, hardware supported, zero-overhead looping, 

• large, shared, multi-ported register file, 

• a high bandwidth, low latency (no striding penally) 
memory. 

• high bandwidth, low latency interface with the com- 
munication agent. 

The UW workhorse instruction of i Warp is called the 
ComptUeAndAcceis (C&A) instruction. As an example of the 
parallelism available, a loop with code 

FOR 1 :« 0 TO si-1 DO BEGIN 
«:-(ACi)*B[3*±l)+f; 

END; 

is compiled into a loop that initiates one iteration every cycle 
surrounded by a loop prologue and epilogue to get the itera- 
tions started. Similarly, a loop body that reads a value VI 
from one message, V2 from another message, computet VI 
* V2 + C [ i] and sends the result as well as V2 on to the 
next processor is also translated into a single C&A instruction 
in the loop body. The single precision C&A instruction 
executes in two clocks, and the double precision C&A in- 
struction executes in four clocks, so both loops execute at the 
peak computation rate of the processor. 

A C&A instruction requires up to 8 operands and produces 
up to 4 results. Memory accesses may produce or consume 
up to two of those operands: either a read and a write or two 
reads. Each memory reference includes an address computa- 
tion (e.g., an indexing operation with a non-unit stride). The 
C&A instruction employs a mad-ahead/write-behind pipeline 
that makes memory read operands from one instruction avail- 
able for use in the next. Conversely, computational results of 
one instruction are written to memory during the next. 
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Those operand reference* that are not satisfied by the 
memory read operation or reed from a gale (see Section 5.4) 
must be to che register file. These operands may themselves 
be the results of previous operations (e.g., intermediate results 
held in the register ffle). To avoid any interins miction 
latencies, the results of the integer unit or a floating-point unit 
may be "bypassed" directly back to that unit as an input 
operand, without waiting for the destination register file loca- 
tion to be updated. Also, the results of either floating-point 
unit may be "bypassed" directly to the other floating-point 
unit (e.g., to support multiply-eccumulate sequences). 

Thus, the execution of a single C&A instruction can include 
up to one fLo Qiing -poini multiplication, one floating-point ad- 
dition, two memory accesses (including two integer opera- 
tions for addressing), four gate accesses, several more register 
accesses (enough to provide the rest of the required operands), 
and branching back to the beginning of the loop. 

Incremental to the single "long" C&A instruction, the 
iWarp computation agent provides a full complement of 
"short" instructions. They can be thought of as 2 and 3 
address RISC-like instructions. These "short" instructions 
are provided to make iWarp a generally programmable 
processor. They usually control only a single functional unit. 

6 *2. Livennore Loops performance 

The Livcrmore Loops [6], a set of computational kernels 
typically found in scientific computing, have been used since 
the 1960's as a benchmark for computer systems. The loops 
range from having no data dependence between iterations 
(easily vectorizable) to having only a single recurrence 
(strictly sequential). This combination of vector and scalar 
code provides a good measure on the performance of a 
machine across a spectrum of scientific computing require- 
ments. 

The Livcrmore loops were manually translated from 
FORTRAN to W2, (W2 is a Pascal-like language developed 
for the Warp machine. The retargeted W2 
compiler [8, 16, 17] has been used as a tool in developing and 
evaluating the iWarp architecture.) The translation into W2 
was straightforward, preserving loop structures and changing 
only syntax, except for kernels 15 and 16, which were trans- 
lated from Feo's restructured loops [6]. 

The performance of Livcrmore Loops (double precisian) on 
a single iWarp processor is presented in Table 6-1. The 
unweighted mean is 4.2 M FLOPS, the standard deviation is 
2.6 MFLOPS and the harmonic mean is 2.7 MFLOPS. Since 
the machine's peak double-precision performance is 10 
MFLOPS, these numbers demonstrate a highly effective use 
of the raw computation power of i Warp. 

The iWarp cell is a scalar processor, and docs not require 
that loops be vectorizable for full utilization of its floating- 
point units. This is why it does not exhibit the same tremen- 
dous disparity in MFLOPS rates for the different loops as do 
vector machines. Nonetheless, the variation between the 
MFLOPS rates obtained is still significant. Near peak perfor- 
mance can be achieved (using a high-level language and an 
optimizing compiler), as demonstrated by kernels 3 and 7. On 
the other hand, performance of near 1 MFLOPS is also ob- 
served. The factors that limit iWarp performance are data 
dependency and the critical resource bottleneck. 
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Table 6-1: Double precision performance of 

Uvermore Loops on a single iWarp cell 



Data dependency. Consider kernel 5: 

FOR 1 0 TO n-1 DO BEGIN 

X[1J :o ♦ (Y[1J - X[1-1J); 

BHD; 

The multiplications and additions are serialized because of the 
data dependencies. Just by this consideration alone, IWarp is 
limited to a peak performance of 5 MFLOPS on this loop. 
However, iWarp still executes data dependent code better 
than vector machines. The floating-point units are not 
pipelined, and mere is no penalty on non-unit stride memory 
accesses. More importantly, not all data dependencies force 
the code to be serialized. As long as the loop contains other 
independent floating-point operations, the floating-point units 
can still be utilized. This is a unique advantage an UW 
architecture has over vector machines. 

Critical r* source bottleneck. The execution speed of a 
program is limited by the most heavily used resource. Unless 
both the floating-point multiplier and adder are the most criti- 
cal resources, the peak MFLOPS rate cannot be achieved. 
Programs containing no multiplications cannot run faster than 
5 MFLOPS since the multiplier is idle all the time. For 
example, for kernel 13 since the integer unit is the most 
heavily used resource, the MFLOPS measure is naturally low. 

7. Summary and conclusions 

iWarp is the first of a new class of parallel computer ar- 
chitectures. iWarp integrates both the computation and com- 
munication functionalities into a single VLSI component 
The communication models supported range from large-grain 
message passing to fine-grain systolic communication. 

The computation agent of an iWarp component contains 
floating-point units with a peak performance of 20 and 10 
MFLOPS for single and double precision operations, respec- 
tively, as well as an integer unit that performs 20 million 
integer or logical operations per second. The communication 
agent operates independently of the computation agent; since 
both are implemented on a single chip, tight coupling between 
communication and computation is possible. This permits 
efficient systolic communication, as well as low-overhead 
message passing. 

iWarp is designed to be a building block for high perfor- 
mance parallel systems. Not only does iWarp have impres- 
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sive computational capabilities, il also has exception al com- 
municaiion capabilities, making iWarp suitable for both scien- 
tific computing and high speed signal processing. The first 
iWarp based systems will be ID array s, rings, 2D arrays or 
tori but the iWarp component is flexible enough to be used in 
numerous other organizations. 

We anticipate iWarp to have a significant impact on the 
practice of parallel computing. Arrays of thousands of cells 
are feasible, program mable, and much cheaper than many 
other super co m puters of comparable power. iWarp systems 
can have a variety of goals: they can be special ox general 
purpose, and experimental or commercial The support of 
well accepted languages for the cell like FORTRAN and C. 
together with parallel program generators to simplify the pro- 
gramming of the array, make it possible to program the 
diverse parallel machines that can be realized with iWarp 
components. 

The iWarp component has to meet the diverse requirements 
of fine- grain and coarse-grain communication for various ap- 
plications including scientific computing and signal process- 
ing. The design of the iWarp component has convinced us 
that these requirements are not incompatible and, in fact, do 
reinforce each other. The high bandwidth/low latency com- 
munication mechanism in iWarp implements both message 
passing and systolic communication efficiently. This synergy 
makes iWarp a suitable building block for the affordable 
supcrcomputing systems of the future. 
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